Despite some successful applications of goal-driven navigation, existing deep reinforcement learning-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-art baselines. Demonstration videos are available at \colorb{https://youtu.be/93LGlGvaN0c.
translated by 谷歌翻译
为了应对人类检测对标签数据和隐私问题的不断增长的需求,合成数据已被用作替代品,并在人类检测和跟踪任务中显示出令人鼓舞的结果。我们参加了第七届基准测试多目标跟踪(BMTT)的研讨会,主题是“合成数据可以带我们多远”?我们的解决方案Pietrack是根据合成数据开发的,而无需使用任何预训练的权重。我们提出了一种自我监督的域适应方法,该方法能够减轻合成(例如Motsynth)和真实数据(例如Mot17)之间的域移位问题,而无需涉及额外的人类标签。通过利用拟议的多尺度合奏推理,我们在MOT17测试集中获得了58.7的最终HOTA得分,在挑战中排名第三。
translated by 谷歌翻译
基于变压器的监督预培训在重新识别(REID)中实现了良好的性能。但是,由于想象成和Reid数据集之间的域间隙,它通常需要更大的预训练数据集(例如,ImageNet-21k),以提高性能,因为变压器的强大数据拟合能力。为了解决这一挑战,这项工作可以分别从数据和模型结构的角度降低预训练和REID数据集之间的差距。我们首先调查在未标记的人物图像(Luperson DataSet)上的视觉变压器(VIV)的自我监督为了进一步降低域间隙并加速预训练,提出了灾难性的遗忘得分(CFS)来评估预训练和微调数据之间的差距。基于CFS,通过采样靠近下游REID数据的相关数据来选择一个子集,并从预训练的数据集中过滤无关数据。对于模型结构,提出了一种名为基于IBN的卷积词条(ICS)的特定于REID的模块来通过学习更不变的功能来弥合域间隙。已经进行了广泛的实验,以微调在监督学习,无监督域适应(UDA)和无监督的学习(USL)设置下进行预训练模型。我们成功将Luperson DataSet缩小为50%,没有性能下降。最后,我们在市场-1501和MSMT17上实现了最先进的表现。例如,我们的VIT-S / 16在Market1501上实现了91.3%/ 89.9%/ 89.6%用于监督/ UDA / USL REID的11501。代码和模型将发布到https://github.com/michuanhaohao/transreid -sl。
translated by 谷歌翻译
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.
translated by 谷歌翻译
注意机制在点云分析中发挥了越来越重要的作用,并且渠道注意是热点之一。通过这么多的频道信息,神经网络难以筛选有用的信道信息。因此,提出了一种自适应信道编码机制以在本文中捕获信道关系。它通过明确地编码其特征信道之间的相互依赖来提高网络生成的表示的质量。具体地,提出了一种通道 - 明智的卷积(通道-Chim)以自适应地学习坐标和特征之间的关系,以便编码信道。与流行的重量方案不同,本文提出的通道CONN实现了卷积操作的适应性,而不是简单地为频道分配不同的权重。对现有基准的广泛实验验证了我们的方法实现了艺术的状态。
translated by 谷歌翻译
变压器在各种计算机视觉地区发挥着越来越重要的作用,并且在点云分析中也取得了显着的成就。由于它们主要专注于点亮变压器,因此本文提出了一种自适应通道编码变压器。具体地,被设计为对频道的通道卷积旨在对信道进行编码。它可以通过捕获坐标和特征之间的潜在关系来编码特征通道。与简单地为每个通道分配注意重量相比,我们的方法旨在自适应地对信道进行编码。此外,我们的网络采用了邻域搜索方法的低级和高级双语义接收领域,以提高性能。广泛的实验表明,我们的方法优于三个基准数据集的最先进的点云分类和分段方法。
translated by 谷歌翻译
最近,深度神经网络在3D点云分类方面取得了显着成就。然而,现有的分类方法主要在理想化点云上实施,并在非理想情况下遭受重大降解的每种性能。为了处理该Prob-LEM,提出了一个名为双邻居深度融合网络(DNDFN)的特征表示学习方法,以用作非理想点云分类任务的改进点云编码器。 DNDFN利用称为TN学习的培训邻域学习方法来捕获全局关键邻域。然后,全球邻居与当地邻居融合,以帮助网络实现更强大的推理能力。此外,提出了一个信息传输卷积(IT-CONV)为DNDFN学习点对对之间的边缘信息,并使特征传输过程受益。 IT-CONV中的信息传输类似于图中的信息的传播,其使DNDF​​N更靠近人工理工模式。关于现有基准的广泛实验尤其是非理想的数据集验证了DNDFN和DNDFN实现了最先进的效果。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.
translated by 谷歌翻译